Sound 11
☆ Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark CVPR 2024
Ziyang Chen, Israel D. Gebru, Christian Richardt, Anurag Kumar, William Laney, Andrew Owens, Alexander Richard
We present a new dataset called Real Acoustic Fields (RAF) that captures real
acoustic room data from multiple modalities. The dataset includes high-quality
and densely captured room impulse response data paired with multi-view images,
and precise 6DoF pose tracking data for sound emitters and listeners in the
rooms. We used this dataset to evaluate existing methods for novel-view
acoustic synthesis and impulse response generation which previously relied on
synthetic data. In our evaluation, we thoroughly assessed existing audio and
audio-visual models against multiple criteria and proposed settings to enhance
their performance on real-world data. We also conducted experiments to
investigate the impact of incorporating visual data (i.e., images and depth)
into neural acoustic field models. Additionally, we demonstrated the
effectiveness of a simple sim2real approach, where a model is pre-trained with
simulated data and fine-tuned with sparse real-world data, resulting in
significant improvements in the few-shot learning approach. RAF is the first
dataset to provide densely captured room acoustic data, making it an ideal
resource for researchers working on audio and audio-visual neural acoustic
field modeling techniques. Demos and datasets are available on our project
page: https://facebookresearch.github.io/real-acoustic-fields/
comment: Accepted to CVPR 2024. Project site:
https://facebookresearch.github.io/real-acoustic-fields/
☆ Duolando: Follower GPT with Off-Policy Reinforcement Learning for Dance Accompaniment ICLR 2024
We introduce a novel task within the field of 3D dance generation, termed
dance accompaniment, which necessitates the generation of responsive movements
from a dance partner, the "follower", synchronized with the lead dancer's
movements and the underlying musical rhythm. Unlike existing solo or group
dance generation tasks, a duet dance scenario entails a heightened degree of
interaction between the two participants, requiring delicate coordination in
both pose and position. To support this task, we first build a large-scale and
diverse duet interactive dance dataset, DD100, by recording about 117 minutes
of professional dancers' performances. To address the challenges inherent in
this task, we propose a GPT-based model, Duolando, which autoregressively
predicts the subsequent tokenized motion conditioned on the coordinated
information of the music, the leader's and the follower's movements. To further
enhance the GPT's capabilities of generating stable results on unseen
conditions (music and leader motions), we devise an off-policy reinforcement
learning strategy that allows the model to explore viable trajectories from
out-of-distribution samplings, guided by human-defined rewards. Based on the
collected dataset and proposed method, we establish a benchmark with several
carefully designed metrics.
comment: ICLR 2024
☆ A Diffusion-Based Generative Equalizer for Music Restoration
This paper presents a novel approach to audio restoration, focusing on the
enhancement of low-quality music recordings, and in particular historical ones.
Building upon a previous algorithm called BABE, or Blind Audio Bandwidth
Extension, we introduce BABE-2, which presents a series of significant
improvements. This research broadens the concept of bandwidth extension to
\emph{generative equalization}, a novel task that, to the best of our
knowledge, has not been explicitly addressed in previous studies. BABE-2 is
built around an optimization algorithm utilizing priors from diffusion models,
which are trained or fine-tuned using a curated set of high-quality music
tracks. The algorithm simultaneously performs two critical tasks: estimation of
the filter degradation magnitude response and hallucination of the restored
audio. The proposed method is objectively evaluated on historical piano
recordings, showing a marked enhancement over the prior version. The method
yields similarly impressive results in rejuvenating the works of renowned
vocalists Enrico Caruso and Nellie Melba. This research represents an
advancement in the practical restoration of historical music.
comment: Submitted to DAFx24. Historical music restoration examples are
available at: http://research.spa.aalto.fi/publications/papers/dafx-babe2/
☆ Fusion approaches for emotion recognition from speech using acoustic and text-based features ICASSP 2020
In this paper, we study different approaches for classifying emotions from
speech using acoustic and text-based features. We propose to obtain
contextualized word embeddings with BERT to represent the information contained
in speech transcriptions and show that this results in better performance than
using Glove embeddings. We also propose and compare different strategies to
combine the audio and text modalities, evaluating them on IEMOCAP and
MSP-PODCAST datasets. We find that fusing acoustic and text-based systems is
beneficial on both datasets, though only subtle differences are observed across
the evaluated fusion approaches. Finally, for IEMOCAP, we show the large effect
that the criteria used to define the cross-validation folds have on results. In
particular, the standard way of creating folds for this dataset results in a
highly optimistic estimation of performance for the text-based system,
suggesting that some previous works may overestimate the advantage of
incorporating transcriptions.
comment: 5 pages. Accepted in ICASSP 2020
☆ ACES: Evaluating Automated Audio Captioning Models on the Semantics of Sounds
Automated Audio Captioning is a multimodal task that aims to convert audio
content into natural language. The assessment of audio captioning systems is
typically based on quantitative metrics applied to text data. Previous studies
have employed metrics derived from machine translation and image captioning to
evaluate the quality of generated audio captions. Drawing inspiration from
auditory cognitive neuroscience research, we introduce a novel metric approach
-- Audio Captioning Evaluation on Semantics of Sound (ACES). ACES takes into
account how human listeners parse semantic information from sounds, providing a
novel and comprehensive evaluation perspective for automated audio captioning
systems. ACES combines semantic similarities and semantic entity labeling. ACES
outperforms similar automated audio captioning metrics on the Clotho-Eval FENSE
benchmark in two evaluation categories.
☆ Noise-Robust Keyword Spotting through Self-supervised Pretraining
Voice assistants are now widely available, and to activate them a keyword
spotting (KWS) algorithm is used. Modern KWS systems are mainly trained using
supervised learning methods and require a large amount of labelled data to
achieve a good performance. Leveraging unlabelled data through self-supervised
learning (SSL) has been shown to increase the accuracy in clean conditions.
This paper explores how SSL pretraining such as Data2Vec can be used to enhance
the robustness of KWS models in noisy conditions, which is under-explored.
Models of three different sizes are pretrained using different pretraining
approaches and then fine-tuned for KWS. These models are then tested and
compared to models trained using two baseline supervised learning methods, one
being standard training using clean data and the other one being multi-style
training (MTR). The results show that pretraining and fine-tuning on clean data
is superior to supervised learning on clean data across all testing conditions,
and superior to supervised MTR for testing conditions of SNR above 5 dB. This
indicates that pretraining alone can increase the model's robustness. Finally,
it is found that using noisy data for pretraining models, especially with the
Data2Vec-denoising approach, significantly enhances the robustness of KWS
models in noisy conditions.
☆ Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation
Transformers have been the most successful architecture for various speech
modeling tasks, including speech separation. However, the self-attention
mechanism in transformers with quadratic complexity is inefficient in
computation and memory. Recent models incorporate new layers and modules along
with transformers for better performance but also introduce extra model
complexity. In this work, we replace transformers with Mamba, a selective state
space model, for speech separation. We propose dual-path Mamba, which models
short-term and long-term forward and backward dependency of speech signals
using selective state spaces. Our experimental results on the WSJ0-2mix data
show that our dual-path Mamba models match or outperform dual-path transformer
models Sepformer with only 60% of its parameters, and the QDPN with only 30% of
its parameters. Our large model also reaches a new state-of-the-art SI-SNRi of
24.4 dB.
♻ ☆ NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao
While recent large-scale text-to-speech (TTS) models have achieved
significant progress, they still fall short in speech quality, similarity, and
prosody. Considering speech intricately encompasses various attributes (e.g.,
content, prosody, timbre, and acoustic details) that pose significant
challenges for generation, a natural idea is to factorize speech into
individual subspaces representing different attributes and generate them
individually. Motivated by it, we propose NaturalSpeech 3, a TTS system with
novel factorized diffusion models to generate natural speech in a zero-shot
way. Specifically, 1) we design a neural codec with factorized vector
quantization (FVQ) to disentangle speech waveform into subspaces of content,
prosody, timbre, and acoustic details; 2) we propose a factorized diffusion
model to generate attributes in each subspace following its corresponding
prompt. With this factorization design, NaturalSpeech 3 can effectively and
efficiently model intricate speech with disentangled subspaces in a
divide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms the
state-of-the-art TTS systems on quality, similarity, prosody, and
intelligibility, and achieves on-par quality with human recordings.
Furthermore, we achieve better performance by scaling to 1B parameters and 200K
hours of training data.
comment: Achieving human-level quality and naturalness on multi-speaker
datasets (e.g., LibriSpeech) in a zero-shot way
♻ ☆ Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification
Previous studies demonstrate the impressive performance of residual neural
networks (ResNet) in speaker verification. The ResNet models treat the time and
frequency dimensions equally. They follow the default stride configuration
designed for image recognition, where the horizontal and vertical axes exhibit
similarities. This approach ignores the fact that time and frequency are
asymmetric in speech representation. In this paper, we address this issue and
look for optimal stride configurations specifically tailored for speaker
verification. We represent the stride space on a trellis diagram, and conduct a
systematic study on the impact of temporal and frequency resolutions on the
performance and further identify two optimal points, namely Golden Gemini,
which serves as a guiding principle for designing 2D ResNet-based speaker
verification models. By following the principle, a state-of-the-art ResNet
baseline model gains a significant performance improvement on VoxCeleb, SITW,
and CNCeleb datasets with 7.70%/11.76% average EER/minDCF reductions,
respectively, across different network depths (ResNet18, 34, 50, and 101),
while reducing the number of parameters by 16.5% and FLOPs by 4.1%. We refer to
it as Gemini ResNet. Further investigation reveals the efficacy of the proposed
Golden Gemini operating points across various training conditions and
architectures. Furthermore, we present a new benchmark, namely the Gemini
DF-ResNet, using a cutting-edge model.
comment: Accepted to IEEE/ACM Transactions on Audio, Speech, and Language
Processing. Copyright may be transferred without notice, after which this
version may no longer be accessible
♻ ☆ LCANets++: Robust Audio Classification using Multi-layer Neural Networks with Lateral Competition ICASSP
Audio classification aims at recognizing audio signals, including speech
commands or sound events. However, current audio classifiers are susceptible to
perturbations and adversarial attacks. In addition, real-world audio
classification tasks often suffer from limited labeled data. To help bridge
these gaps, previous work developed neuro-inspired convolutional neural
networks (CNNs) with sparse coding via the Locally Competitive Algorithm (LCA)
in the first layer (i.e., LCANets) for computer vision. LCANets learn in a
combination of supervised and unsupervised learning, reducing dependency on
labeled samples. Motivated by the fact that auditory cortex is also sparse, we
extend LCANets to audio recognition tasks and introduce LCANets++, which are
CNNs that perform sparse coding in multiple layers via LCA. We demonstrate that
LCANets++ are more robust than standard CNNs and LCANets against perturbations,
e.g., background noise, as well as black-box and white-box attacks, e.g.,
evasion and fast gradient sign (FGSM) attacks.
comment: Accepted at 2024 IEEE International Conference on Acoustics, Speech
and Signal Processing Workshops (ICASSPW)
♻ ☆ ÌròyìnSpeech: A multi-purpose Yorùbá Speech Corpus
We introduce \`{I}r\`{o}y\`{i}nSpeech, a new corpus influenced by the desire
to increase the amount of high quality, contemporary Yor\`{u}b\'{a} speech
data, which can be used for both Text-to-Speech (TTS) and Automatic Speech
Recognition (ASR) tasks. We curated about 23000 text sentences from news and
creative writing domains with the open license CC-BY-4.0. To encourage a
participatory approach to data creation, we provide 5000 curated sentences to
the Mozilla Common Voice platform to crowd-source the recording and validation
of Yor\`{u}b\'{a} speech data. In total, we created about 42 hours of speech
data recorded by 80 volunteers in-house, and 6 hours of validated recordings on
Mozilla Common Voice platform. Our TTS evaluation suggests that a
high-fidelity, general domain, single-speaker Yor\`{u}b\'{a} voice is possible
with as little as 5 hours of speech. Similarly, for ASR we obtained a baseline
word error rate (WER) of 23.8.
comment: Accepted to LREC-COLING 2024